CohortFinder: an open-source tool for data-driven partitioning of digital pathology and imaging cohorts to yield robust machine-learning models

Batch effects (BEs) refer to systematic technical differences in data collection unrelated to biological variations whose noise is shown to negatively impact machine learning (ML) model generalizability. Here we release CohortFinder (http://cohortfinder.com), an open-source tool aimed at mitigating BEs via data-driven cohort partitioning. We demonstrate CohortFinder improves ML model performance in downstream digital pathology and medical image processing tasks. CohortFinder is freely available for download at cohortfinder.com.


Quantitative result
In the external testing set (n=25 patients), the best F1 overall score results were from BC partitioning (0.95±0.03), followed by AC (0.94±0.04), and finally WC (0.93±0.09), with statistically significant differences between WC and AC (p<0.01) as well as between WC and BC (p<0.01).While no statistically significant differences were observed between AC and BC (p=0.71),AC resulted in a larger range of F1 scores (the violin plot in Figure 2-A), a lower average F1 value, and a higher standard deviation in F1 scores (the table in Figure 2-A shows the overall average ± standard deviation results for external testing results in terms of all the measurements for all the 3 use cases) compared to BC.This suggests less robust performance for AC compared to BC.

Qualitative result.
In Figure 2-A and Supplementary Figure 7-A, WC partitioning results in a relatively higher number of false negative (FN) areas (overlayed green regions) in comparison to AC and BC.Additionally, AC yielded a marginally higher number of false positive (FP) and FN regions (highlighted in fuchsia and green, respectively) when compared to BC.

Quantitative result
In the external testing dataset (n=21 patients), the F1 score is seen to be significantly higher when comparing BC and WC (0.87±0.11 vs 0.64±0.32,p<0.01) as well as between AC and WC (0.81±0.21 vs 0.64±0.32,p<0.01).Though no significant differences were found between BC and AC (0.87±0.11 vs 0.81±0.21,p = 0.09), the violin plots in Figure 2-B and Supplementary Figure 8 suggest that BC has a more compact F1 score distribution, a higher average F1 score, as well as a lower standard deviation compared to AC.

Qualitative result.
In Figure 2-B and Supplementary Figure 7-B, classification heatmaps produced via BC partitioning exhibit the highest degree of similarity with the ground-truth mask.WC partitioning resulted in a significant underprediction of the tumor area, with a considerable number of false negative (FN) patches within normal tissue.AC partitioning yielded a slightly smaller prediction of tumor area when compared to BC.

Quantitative result
In the external testing dataset (n=10 patients), BC models resulted in the highest overall F1 score of 0.68±0.20,while the AC and WC models yielded significantly lower overall F1 scores of 0.63±0.23 (p<0.01 vs BC) and 0.62 ±0.20 (p<0.01 vs BC), respectively (Figure 2-C shows these measurements for the rectal cancer segmentation task).The markedly higher standard deviation in F1 scores of the WC models is illustrated in the violin plots of Supplementary Figure 8-C.Notably, the bottom tails of the F1 score distribution for WC models (green) are seen to be longer and wider in comparison to those of the AC (red) and BC models (blue).This suggests that tumor segmentations by WC models were more varied and shared little overlap with expert annotations, resulting in marked variations in model performance compared to AC and BC.BC tumor segmentations are seen to consistently overlap with expert delineations, while AC models appear to slightly over-segment the tumor region.By comparison, the WC model is seen to have a more varied performance in terms of under-segmenting or over-segmenting the tumor.

S3. Batch-effect severity evaluation (BE score)
To assist users in quantifying the severity of batch effects, we conducted a preliminary evaluation of clustering metrics to determine the segregation of detected BE-groups: (a) Silhouette coefficient, (b) Davies-Bouldin index, and (c) Calinski-Harabasz index 9 .These scores are now reported both the CohortFinder output files and in the user interface of MRQy and HistoQC.Initial experimental evaluation of the BE score were conducted using: (a) the entire cohort (where significant BEs may be expected to be present as it is multi-institutional), (b) site D5 (where fewer/minimal BEs are likely to be present, uniinstitutional).Our preliminary results show that cohorts exhibiting the more severe BEs exhibit higher BE scores (see Supplementary Figure 9).In future work, we will investigate the impact of these metrics on downstream applications, such as for the selection of k (i.e., BE-groups).

Figure 2 -
Figure 2-C and Supplementary Figure 7-C depict representative tumor segmentations obtained via

Table 1 .
List of use cases and associated experiments employed for the evaluation of CohortFinder.This table encompasses three distinct use cases: 1) Tubule segmentation within the NEPTUNE cohort (Pathology), 2)Classification of colon adenocarcinoma in the TCGA-COAD cohort (Pathology), and 3) Segmentation of rectal cancer using cohorts accrued from University Hospitals and Cleveland Clinic (Radiology-MRI).

Table 2
. Formulas for Quantitative Assessment Metrics.This table provides the mathematical expressions used to compute five key metrics for performance evaluation: Precision, Recall, Accuracy, Intersection Over Union (IOU), and the F1 Score.Definitions included False Positives (FP), False Negatives (FN), True Positives (TP), True Negatives (TN).

Table 3 .
Summary of performance measures for 3 different use cases, reported on both internal and external testing datasets.For the F1 score, the best performance is highlighted in red, and the worst performance is highlighted in blue.

Table 4 .
Detailed performance measures for each experiment for all three use cases.For the F1 score, the best performance is highlighted in red, and the worst performance is highlighted in blue.This occurs when the model correctly predicts the absence of a feature or condition, such as correctly identifying that a segment of an image does not contain pathological tissue or accurately classifying an image as healthy.TPTrue positive This is when the model correctly identifies a feature or condition present in the medical image, such as accurately recognizing a tumor in a segmentation task or correctly classifying an image as indicative of disease.FNFalse negative This is when the model fails to identify a feature or condition that is actually present, such as not detecting a tumor that exists in a segmentation task or failing to classify a diseased image correctly.FP False positiveThis happens when the model incorrectly identifies a feature or condition as present, such as mistakenly delineating a region as a tumor in a healthy tissue segment or classifying a healthy image as showing signs of disease.Supplementary

Table 5 .
Acronyms and definitions utilized in the manuscript.